HistSearch - Implementation and Evaluation of a Web-based Tool for Automatic Information Extraction from Historical Text

نویسندگان

  • Eva Pettersson
  • Jonas Lindström
  • Benny Jacobsson
  • Rosemarie Fiebranz
چکیده

Due to a lack of NLP tools adapted to the task of analysing historical text, historians and other researchers in humanities often need to manually search through large volumes of text in order to find certain pieces of information of interest to their research. In this paper, we present a web-based tool for automatic information extraction from historical text, with the aim of facilitating this time-consuming process. We describe 1) the underlying architecture of the system, based on spelling normalisation succeeded by tagging and parsing using tools available for the modern language, 2) a prototypical graphical user interface used by the historians, and 3) a thorough manual evaluation of the tool performed by the actual users, i.e. the historians, when applied to the specific task of extracting and presenting verb phrases describing work in Early Modern Swedish text. The main contribution is the manual evaluation, which takes both quantitative and qualitative aspects into account, and is compared to automatic evaluation results. We show that spelling normalisation is successful for the task of tagging and lemmatisation, meaning that the words analysed as verbs by the tool are mostly considered as verbs by the historians as well. We also point out the further work needed for improving parsing and ranking performance, in order to make the tool really useful in the extraction process.

برای دانلود رایگان متن کامل این مقاله و بیش از 32 میلیون مقاله دیگر ابتدا ثبت نام کنید

ثبت نام

اگر عضو سایت هستید لطفا وارد حساب کاربری خود شوید

منابع مشابه

Presenting a method for extracting structured domain-dependent information from Farsi Web pages

Extracting structured information about entities from web texts is an important task in web mining, natural language processing, and information extraction. Information extraction is useful in many applications including search engines, question-answering systems, recommender systems, machine translation, etc. An information extraction system aims to identify the entities from the text and extr...

متن کامل

EXTRACTION-BASED TEXT SUMMARIZATION USING FUZZY ANALYSIS

Due to the explosive growth of the world-wide web, automatictext summarization has become an essential tool for web users. In this paperwe present a novel approach for creating text summaries. Using fuzzy logicand word-net, our model extracts the most relevant sentences from an originaldocument. The approach utilizes fuzzy measures and inference on theextracted textual information from the docu...

متن کامل

Data Extraction using Content-Based Handles

In this paper, we present an approach and a visual tool, called HWrap (Handle Based Wrapper), for creating web wrappers to extract data records from web pages. In our approach, we mainly rely on the visible page content to identify data regions on a web page. In our extraction algorithm, we inspired by the way a human user scans the page content for specific data. In particular, we use text fea...

متن کامل

A survey on Automatic Text Summarization

Text summarization endeavors to produce a summary version of a text, while maintaining the original ideas. The textual content on the web, in particular, is growing at an exponential rate. The ability to decipher through such massive amount of data, in order to extract the useful information, is a major undertaking and requires an automatic mechanism to aid with the extant repository of informa...

متن کامل

بهبود خلاصه سازی خودکار متون فارسی با استفاده از روش‌های پردازش زبان طبیعی و گراف شباهت

A significant amount of available information is stored in textual databases which contains a large collection of documents from different sources (such as news, articles, books, emails and web pages). The increasing visibility and importance of this class of information motivates us to work on having better automatic evaluation tools for textual resources. The automatic summarization of tex...

متن کامل

ذخیره در منابع من


  با ذخیره ی این منبع در منابع من، دسترسی به آن را برای استفاده های بعدی آسان تر کنید

عنوان ژورنال:

دوره   شماره 

صفحات  -

تاریخ انتشار 2016